在本文中,我们研究了电子商务运营商面临的顺序决策问题,与何时从中央仓库发送车辆以服务于客户请求,并在哪个命令下提供服务,假设是在到达仓库的包裹是随机且动态的。目的是最大化在服务时间内可以交付的包裹数。我们提出了两种解决此问题的强化学习方法,一种基于策略函数近似(PFA),第二种基于值函数近似(VFA)。两种方法都与前景策略相结合,其中未来发布日期以蒙特卡洛的方式进行采样,并使用量身定制的批处理方法来近似未来状态的价值。我们的PFA和VFA很好地利用了基于分支机构的精确方法来提高决策质量。我们还建立了足够的条件,可以将最佳策略的部分表征并将其集成到PFA/VFA中。在基于720个基准实例的实证研究中,我们使用具有完美信息的上限进行了竞争分析,我们表明PFA和VFA的表现极大地超过了两种替代近视方法。总体而言,PFA提供最佳解决方案,而VFA(从两阶段随机优化模型中受益)在解决方案质量和计算时间之间取得了更好的权衡。
translated by 谷歌翻译
Earthquakes, fire, and floods often cause structural collapses of buildings. The inspection of damaged buildings poses a high risk for emergency forces or is even impossible, though. We present three recent selected missions of the Robotics Task Force of the German Rescue Robotics Center, where both ground and aerial robots were used to explore destroyed buildings. We describe and reflect the missions as well as the lessons learned that have resulted from them. In order to make robots from research laboratories fit for real operations, realistic test environments were set up for outdoor and indoor use and tested in regular exercises by researchers and emergency forces. Based on this experience, the robots and their control software were significantly improved. Furthermore, top teams of researchers and first responders were formed, each with realistic assessments of the operational and practical suitability of robotic systems.
translated by 谷歌翻译
Strategic test allocation plays a major role in the control of both emerging and existing pandemics (e.g., COVID-19, HIV). Widespread testing supports effective epidemic control by (1) reducing transmission via identifying cases, and (2) tracking outbreak dynamics to inform targeted interventions. However, infectious disease surveillance presents unique statistical challenges. For instance, the true outcome of interest - one's positive infectious status, is often a latent variable. In addition, presence of both network and temporal dependence reduces the data to a single observation. As testing entire populations regularly is neither efficient nor feasible, standard approaches to testing recommend simple rule-based testing strategies (e.g., symptom based, contact tracing), without taking into account individual risk. In this work, we study an adaptive sequential design involving n individuals over a period of {\tau} time-steps, which allows for unspecified dependence among individuals and across time. Our causal target parameter is the mean latent outcome we would have obtained after one time-step, if, starting at time t given the observed past, we had carried out a stochastic intervention that maximizes the outcome under a resource constraint. We propose an Online Super Learner for adaptive sequential surveillance that learns the optimal choice of tests strategies over time while adapting to the current state of the outbreak. Relying on a series of working models, the proposed method learns across samples, through time, or both: based on the underlying (unknown) structure in the data. We present an identification result for the latent outcome in terms of the observed data, and demonstrate the superior performance of the proposed strategy in a simulation modeling a residential university environment during the COVID-19 pandemic.
translated by 谷歌翻译
Accurate speed estimation of road vehicles is important for several reasons. One is speed limit enforcement, which represents a crucial tool in decreasing traffic accidents and fatalities. Compared with other research areas and domains, the number of available datasets for vehicle speed estimation is still very limited. We present a dataset of on-road audio-video recordings of single vehicles passing by a camera at known speeds, maintained stable by the on-board cruise control. The dataset contains thirteen vehicles, selected to be as diverse as possible in terms of manufacturer, production year, engine type, power and transmission, resulting in a total of $ 400 $ annotated audio-video recordings. The dataset is fully available and intended as a public benchmark to facilitate research in audio-video vehicle speed estimation. In addition to the dataset, we propose a cross-validation strategy which can be used in a machine learning model for vehicle speed estimation. Two approaches to training-validation split of the dataset are proposed.
translated by 谷歌翻译
In this paper, we build on advances introduced by the Deep Q-Networks (DQN) approach to extend the multi-objective tabular Reinforcement Learning (RL) algorithm W-learning to large state spaces. W-learning algorithm can naturally solve the competition between multiple single policies in multi-objective environments. However, the tabular version does not scale well to environments with large state spaces. To address this issue, we replace underlying Q-tables with DQN, and propose an addition of W-Networks, as a replacement for tabular weights (W) representations. We evaluate the resulting Deep W-Networks (DWN) approach in two widely-accepted multi-objective RL benchmarks: deep sea treasure and multi-objective mountain car. We show that DWN solves the competition between multiple policies while outperforming the baseline in the form of a DQN solution. Additionally, we demonstrate that the proposed algorithm can find the Pareto front in both tested environments.
translated by 谷歌翻译
强化学习(RL)政策的解释性仍然是一个具有挑战性的研究问题,尤其是在安全环境中考虑RL时。理解RL政策的决策和意图提供了将安全性纳入政策的途径,通过限制不良行动。我们建议使用布尔决策规则模型来创建基于事后规则的代理政策的摘要。我们使用经过训练的熔岩网格世界训练的DQN代理评估我们的方法,并表明可以创建此GRIDWORLD的手工制作的功能表示,可以创建简单的广义规则,从而提供代理商策略的可解释后摘要。我们讨论了可能通过使用该规则模型生成的规则作为对代理策略施加的约束的规则,并讨论如何创建代理策略的简单规则摘要可能有助于在调试过程中创建简单的规则摘要,从而讨论了将安全引入RL代理政策的可能途径。RL代理。
translated by 谷歌翻译
医学图像分割的深度学习模型可能会出乎意料地且出乎意料地失败,而与训练图像相比,在不同中心获得的病理案例和图像,标签错误违反了专家知识。此类错误破坏了对医学图像细分的深度学习模型的可信赖性。检测和纠正此类故障的机制对于将该技术安全地转化为诊所至关重要,并且可能是对未来人工智能法规(AI)的要求。在这项工作中,我们提出了一个值得信赖的AI理论框架和一个实用系统,该系统可以使用后备方法和基于Dempster-Shafer理论的失败机制增强任何骨干AI系统。我们的方法依赖于可信赖的AI的可行定义。我们的方法会自动放弃由骨干AI预测的体素级标签,该标签违反了专家知识,并依赖于这些体素的后备。我们证明了拟议的值得信赖的AI方法在最大的报告的胎儿MRI的注释数据集中,由13个中心的540个手动注释的胎儿脑3D T2W MRI组成。我们值得信赖的AI方法改善了在各个中心获得的胎儿脑MRI和各种脑异常的胎儿的最先进的主链AI的鲁棒性。
translated by 谷歌翻译
在复杂的任务中,奖励函数并不简单,并且由一组目标,多种强化学习(RL)策略充分地执行任务,但可以通过调整个人目标对奖励功能的影响来训练不同的策略。了解政策之间的策略差异是必要的,使用户能够在提供的策略之间进行选择,可以帮助开发人员了解从各种奖励功能中出现的不同行为,并在RL系统中培训QuantEnparameters。在这项工作中,我们可以比较两项训练在同一任务的两项政策的行为,但在目标中具有不同的偏好。我们提出了一种区分源自来自不同能力的行为的差异的方法,这是两种R1代理商的偏好的结果。此外,我们只使用基于优先级的差异数据,以便产生关于代理偏好的对比解释。最后,我们在自主驾驶任务上测试和评估我们的方法,并比较安全导向政策的行为和更喜欢速度的行为。
translated by 谷歌翻译
强化学习(RL)已用于一系列模拟的现实任务,例如传感器协调,交通光控制和按需移动服务。然而,现实世界部署很少见,因为RL与现实世界环境的动态性质斗争,需要时间学习任务并适应环境的变化。转移学习(TL)可以帮助降低这些适应时间。特别地,在多蛋白RL系统中应用TL的显着潜力,其中多个代理可以彼此共享知识,以及加入系统的新代理。为了获得最大的代理商转移,转移角色(即,确定哪些代理作为源代理并且作为目标),以及在每个特定情况下应动态地选择相关的转移内容参数(例如,转移大小)。作为完全动态转移的第一步,在本文中,我们研究了TL转移参数与固定源和目标角色的影响。具体而言,我们将每个代理环境与代理人的认知信心标记,并且我们使用不同阈值级别和样本大小来过滤共享示例。我们在两种情况下调查了这些参数的影响,标准捕食者 - 猎物RL基准以及带有200个车辆代理的乘车共享系统和10,000名乘车请求的模拟。
translated by 谷歌翻译
组合多站点数据可以加强和揭示趋势,但是是由可以偏向数据的特定特定协变量的影响,因此任何下游分析都会受到任何可能的任务。 HOC后期多站点校正方法存在但具有强烈的假设,通常不会在现实世界中持有。算法应该以可以解释特定于站点的效果的方式设计,例如从序列参数选择中出现的那些,并且在泛型失败的情况下,应该能够通过明确的不确定性建模来识别这种失败。该工作正文展示了这种算法,这可以在分割任务的背景下对收购物理学变得强大,同时建模不确定性。我们展示我们的方法不仅概括为完全熔断数据集,保留了分割质量,但同时也会考虑特定于站点的序列选择,这也允许它作为统一工具执行。
translated by 谷歌翻译